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Abstract. Motivated by the problem of decentralized direction-tracking, we consider the general 
problem of cooperative learning in multi-agent systems with time- varying connectivity and intermit- 
tent measurements. We propose a distributed learning protocol capable of learning an unknown 
vector ii from noisy measurements made independently by autonomous nodes. Our protocol is 
completely distributed and able to cope with the time-varying, unpredictable nature of inter-agent 
connectivity, repeated failures of nodes, and intermittent noisy measurements of /i. Our main results 
bound the learning speed of our protocol in terms of a novel measure of graph connectivity we call 
the sieve constant of a graph. 
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1. Introduction. Widespread deployment of mobile sensors is expected to revo- 
lutionize our ability to monitor and control physical environments. However, for these 
networks to reach their full range of applicability they must be capable of operating in 
uncertain and unstructured environments. Realizing the full potential of networked 
sensor systems will require the development of protocols that are fully distributed and 
adaptive in the face of persistent faults and time- varying, unpredictable environments. 

Our goal in this paper is to initiate the study of cooperative multi-agent learning 
by distributed networks operating in unknown and changing environments, subject 
to node faults and failures of communication links. While our focus here is on the 
basic problem of learning an unknown vector, we hope to contribute to the devel- 
opment of a broad theory of cooperative, distributed learning in such environments, 
with the ultimate aim of designing sensor network protocols capable of learning and 
adaptability. 

We will study a simple, local protocol for learning a vector from intermittent 
measurements and evaluate its performance in terms of the number of nodes and the 
(time-varying) network structure. Our direct motivation is the problem of direction 
tracking from chemical gradients. A network of mobile sensors needs to move in a 
direction [i (understood as a vector on the unit circle) , which none of the sensors ini- 
tially knows; however, intermittently some sensors are able to obtain a noisy sample 
of fi. The sensors can observe the velocity of neighboring sensors but, as the sensors 
move, the set of neighbors of each sensor changes; moreover, new sensors occasionally 
join the network and current sensors sometimes permanently leave the network. The 
challenge is to design a protocol by means of which the sensors can adapt their veloc- 
ities based on the measurements of /i and observations of the velocities of neighboring 
sensors so that every node's velocity converges to /i as fast as possible. 

We will consider a natural generalization in the problem, wherein we abandon the 
constraint that /i lies on the unit circle and instead consider the problem of learning 
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an arbitrary vector fi by a network of mobile nodes subject to time-varying (and 
unpredictable) inter-agent connectivity, repeated failures of nodes, and intermittent, 
noisy measurements. We will be interested in the speed at which local, distributed 
protocols are able to drive every node's estimate of /i to the correct value. We will 
be especially concerned with identifying the salient features of network topology that 
result in good (or poor) performance. 

1.1. Cooperative multi-agent learning. We begin by formally stating the 
problem for a fixed number of nodes. We consider n autonomous nodes engaged in 
the task of learning a vector /i £ M. k . At each time t = 0,1,2,... we denote by 
G(t) = (V(t), E{t)) the graph of inter-agent communications at time t: two nodes are 
connected by an edge in G(t) if and only if they are able to exchange messages at 
time t. Note that by definition the graph G(t) is undirected. If £ G(t) then we 
will say that i and j are neighbors at time t. We will adopt the convention that G(t) 
contains no self-loops. While we do not assume that the graphs G(t) are connected at 
each time t, we do assume they satisfy a standard condition of uniform connectivity: 
there exists some constant positive integer B (unknown to any of the nodes) such that 
the graph sequence G(t) is i3-connected, i.e. the graphs ({1, . . . , n}, U^^\ B E(t)) are 
connected for each integer k > 0. Intuitively, the uniform connectivity condition 
means that once we take all the edges that have appeared between times kB and 
(k + 1)B, the graph is connected. 

Each node maintains an estimate of /z; we will denote the estimate of node i at 
time t by Vi(t). At time t, node i can update fj(t) as a function of the vaues Vj(t) held 
by all neighbors j of node i. Physically, these updates may be the result of a message 
exchange or may come about through observations by each node. Occasionally, some 
nodes have access to a noisy measurement 

(li(t) = /! + Wi(t), 

where Wi (t) is a zero- mean random vector every entry of which has variance a 2 ; we 
assume all vectors Wi(t) are independent of each other and of all i>j(t). In this case, 
node i incorporates this measurement into its updated estimate Vi(t + 1). We will 
make an assumption of uniform measurement speed, namely that at least one node 
has access to a measurement every T steps. 

It is useful to think of this formalization in terms of our motivating scenario, which 
is a collection of nodes - vehicles, UAVs, mobile sensors, or underwater gliders - which 
need to learn and follow a direction. Updated information about the direction arrives 
from time to time as one or more of the nodes takes measurements, and the nodes 
need a protocol by which they update their velocities Vi (t) based on the measurements 
and observations of the velocities of neighboring nodes. 

This formalization also describes the scenario in which a moving group of animals 
must all learn which way to go based on intermittent samples of a preferred direction 
and social interactions with near neighbors. An example is collective migration where 
high costs associated with obtaining measurements of the migration route suggest 
that the majority of individuals rely on the more accessible observations of the relative 
motion of their near neighbors when they update their own velocities [15] . 

1.2. Our results. We now describe the learning protocol which we analyze for 
the remainder of this paper. If at time t node i does not have a measurement of /i, it 
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moves its velocity in the direction of its neighbors: 

, , A(t) v-^ Vj(t) - vAt) , . 

Vi (t + 1 )= Vi (t +^ E MM A Lv L1 

Here A^(t) is the set of neighbors of node i, di(t) is the cardinality of A/j(£), and A(i) 
is a stepsize which we will specify later. 

On the other hand, if node i does have a measurement fj,i(t), it updates as 

Vlit + 1} _ „ w + m Mt) _ „ i(t)) + m s M ^wtw) - <L2) 

Intuitively, each node seeks to align its estimate Vi (t) with both the measurements 
it takes and the estimates of neighboring nodes. As nodes align with one another, 
information from each measurement slowly propagates throughout the system. 

Our protocol is motivated by a number of recent advances within the literature on 
multi-agent consensus. On the one hand, the weights we accord to neighboring nodes 
are based on Metropolis weights (first introduced within the context of multi-agent 
control in [6]) and are chosen because they lead to a tractable Lyapunov analysis as 
in [22]. On the other hand, we introdue a stepsize A(t) which we will later choose 
to decay to zero with t at an appropriate speed by analogy with the recent work on 
multi-agent optimization [231 I2E1 132] • 

Our protocol is also motivated by models used to analyze collective decision mak- 
ing and collective motion in animal groups |13i 117] . Our time varying stepsize rule is 
similar to models of context-dependent interaction in which individuals reduce their 
reliance on social cues when they are progressing towards their target [30] . 

We now state our main results. We will show that this distributed learning 
protocol converges to the correct answer under fairly minimal assumptions and derive 
upper bounds on the convergence rate. The first theorem states the basic convergence 
result which underpins the subsequent results. 

Theorem 1.1. If the stepsize A(i) is nonnegative, nonincreasing and satisfies 

A(i) 

(i) OU, SUJJ - 

t=l t—1 

then for any initial values fi(0), . . . , v n (0), we have that with probability 1 



^ A(t) = +oo, ^ A 2 (i) < oo, sup ^ ^ < oo for any integer c 



lim Vi(t) = /i for all i. 

t— >oo 



Our main goal in this paper is to prove strengthened versions of Theorem l 1 . 1 1 which 
provide quantitative bounds on the rate at which convergence to [i takes place. We are 
particularly interested in the scaling of the convergence time with the combinatorics 
of structure of the interconnection graphs. We will adopt the natural measure of how 
far we are from convergence, namely the sum of the squared distances from the final 
limit: 

n 

Z(v(t)) =J2H(t) ~ 
»=1 



4 



The next theorem provides an upper bound on convergence time for a particularly 
chosen stepsize in the case when B = 1, i.e., when each graph G(t) is connected. 

Theorem 1.2. // B = 1 and A(t) = l/t 1 - e for any e e (0, 1) then 

lim sup &-*E[Z(v(t)) | «(0)] < ng2yl eZ (0) ) (L3 ) 



2 nim t=1>2 ,... 



where n(G) is a measure of graph connectivity which we call the sieve constant of a 
graph, defined as follows. For a nonnegative, stochastic matrix A 6 R" Xn , the sieve 
constant n{A) is defined as 

k(A)= min min x 2 m + a k i(x k - xi) 2 . 
m=1 ,...,„ | N | 2= i ^ 

For an undirected graph G = (V, E), the sieve constant k(G) denotes the sieve constant 
of the Metropolis matrix, which is the stochastic matrix with 



= f max(d t ,rf,) ' if(i,j)€E andi^j, 
13 10, if(i,j)?E. 



To parse the statement of Theorem 11.21 observe that the right-hand side of Eq. 
(1 1 .3|) does not depend on t. Consequently, assuming a positive but very small e « is 
chosen, the theorem states that the expected squared error E[Z(v(t)) \ v(0)] asymp- 
totically decays nearly as fast 1/t. The bulk of the theorem statement describes the 
constant in front of the nearly linear decay, which depends on the problem parame- 
ters, namely the number of nodes n, the sampling period T, the variance of the noise 
a 2 , and the graph interconnection sequence G(t). Crucially, the only influence that 
the structure of the graphs G(t) has on this bound is in terms of the inverse of sieve 
constants n(G(t)). 

This naturally raises the question of how large 1/k(G) can be for an undirected 
graph G on n nodes, and how it depends on the connectivity properties on the graph 
G. We answer this question in the following theorem. 

Theorem 1.3. For any undirected connected graph G, 

—7— < nd max D, 
k(G) 

where rf max is the largest degree of a node in G and D is the diameter of G. 

Taken together, the previous two theorems imply upper bounds on the amount 
of time it takes for E[Z(v(t)) | v(0)} to shrink below eZ(0). Crucially, these upper 
bounds are polynomial in terms of the number of nodes n. We note, also, that our 
bounds can be much better (potentially by orders of magnitude) when diameter is low 
and degrees are small. Intuitively, a low diameter ensures that information propagates 
through few hops from node to node while lower degrees ensure that new pieces of 
information have more influence in nearest-neighbor interactions. 

The proceeding theorem admits a generalization to the case of general B, in which 
case the convergence rate is naturally bounded in terms of the sieve constant of the 
line graph. 
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Theorem 1.4. If A(t) = 1/t 1 -' then 

hm sup t L e E[Z{v(t)) | v{0)\ < — , 

where L n is the undirected line graph on n nodes. 

We will later sketch how these theorems may be adapted to the setting in which 
the number of nodes is not fixed but rather nodes can join and permanently depart 
from the system. 

1.3. Related work. We believe that our paper is the first to derive rigorous 
results for the problem of cooperative multi-agent learning by a network subject to 
node failures, communication disruptions, and intermittent measurements. The key 
features of our model are 1) its cooperative nature (many nodes working together) 2) 
its reliance only on distributed and local observations 3) the incorporation of time- 
varying communication restrictions 4) its flexibility with respect to node faults (our 
results carry over to the setting in which nodes are allowed to enter and leave the 
system). These features are typically required in modern cyber-physical systems. 

Naturally, our work is not the first attempt to fuse learning algorithms with 
distributed control or multi-agent settings. Indeed, the study of learning in games is 
a classic subject which has attracted considerable attention within the last couple of 
decades due in part to its applications to multi-agent systems. We refer the reader to 
the recent papers 0IS[3[IS[9j[Zl[Il[21][lJ[2O]as well as the classic works PS [12] 
which study multi-agent learning in a game-theoretic context. Moreover, the related 
problem of distributed reinforcement learning has attracted some recent attention; 
we refer the reader to [T51 [551 ■ We make no attempt to survey these literatures 
here and refer the reader to the references in the above papers, as well as the surveys 

nam]. 

Finally, we note that much of the recent literature in distributed robotics has 
focused on distributed algorithms robust to faults and communication link failures. 
The number of works is once again too vast to survey, but we refer the reader to 
the representative papers [31 119) . Our work here is very much in the spirit of that 
literature. 

1.4. Outline. The remainder of this paper is organized as follows. Theorems 
1-4 are proved in Section [51 moreover, Section [5] concludes with an extended remark 
sketching how our results can be carried over to settings where the number of nodes 
itself varies with time. We compute the sieve constant for a variety of graphs in 
Section [3] Section 2] provides some practical simulations of our learning protocol and 
Section [5] concludes with a summary of our results and a list of several open questions. 

2. Proofs of the main results. The purpose of this section is to prove Theo- 
rems 1-4. In the course of this, we will demonstrate how the sieve constant naturally 
appears in the analysis of our learning protocol. We begin with some preliminary 
definitions. 

2.1. Definitions. Given a nonnegative matrix A £ R nxn , we will use G(A) 
to denote the graph whose edges correspond to the positive entries of A in the fol- 
lowing way: G(A) is the directed graph on the vertices {1,2, ... ,n} with edge set 
| a-ji > 0}. Note that if A is symmetric then the graph G(A) will be undi- 
rected. We will use the standard convention of ej to mean the i'th basis column 
vector and 1 to mean the all-ones vector. Finally, we will use ri(A) to denote the row 
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sum of the i'th row of A 2 and R(A) — diag(fi(vl), . . . , r n (A)). When the argument 
matrix A is clear from context, we will simply write and R for ri(A), R(A). 

2.2. A few preliminary lemmas. In this subsection we prove a few lemmas 
which we will find useful in the proofs of our main theorems. Our first lemma gives a 
decomposition of a symmetric matrix and its immediate corollary provides a way to 
bound the change in norm arising from multiplication by a symmetric matrix. Similar 
statements were proved in |B],[22], and [3"T] . 

Lemma 2.1. For any symmetric matrix A, 

A 2 =R- £[A 2 ] fei (e fe - e ; )(e fc - e ; ) T 

k<l 

Proof. Observe that each term (e^ — e;)(efe — e;) T in the sum on the right-hand 
side has row sums of zero, and consequently both sides of the above equation have 
identical row sums. Moreover, both sides of the above equation are symmetric. This 
implies it suffices to prove that all the (z, j)-entries of both sides with i < j are the 
same. But on both sides, the («, j)'th element when i < j is L4 2 ]ij. □ 

This lemma may be used to bound how much the norm of a vector changes after 
multiplication by a symmetric matrix. 

Corollary 2.2. For any symmetric matrix A, 

n 

= WxWl-Y^il-r^+^l^Mxk-XL) 2 . 
3=1 k<l 

Proof. By Lemma 12.11 

\\Ax\\ 2 = x T A 2 x 

= x T Rx - ^2\A 2 ] k ix T (e k - e ; )(e fe - e k ) T x 

k<l 

n 

3 = 1 k<l 

Thus the decrease in squared norm from x to Ax is 

n 

NI2 " IMMI2 = £(1 - r )x 2 + £[A 2 ] fei (^ - Xl ) 2 . 

3=1 k<l 

□ 

We now prove a basic positivity property for the sieve constant of a stochastic 
matrix (defined in Section[l} , which we will have occasion to use in the next subsection. 

Lemma 2.3. k(A) > and k(A) > if and only if the graph G(A) is weakly 
connected 

Proof. It is evident from the definition of k(A) that it is necessarily nonnegative. 
If G(A) is not weakly connected, then we can pick m as any vertex, set xi = 



X A directed graph is weakly connected if the undirected graph obtained by ignoring the orienta- 
tions of the edges is connected. 
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on the connected component containing m, and Xi = c on all the other connected 
components; here c is a positive constant chosen so that the normalization condition 
||x||2 = 1 is satisfied. These choices of m and x result in k(A) = 0. 

On the other hand, suppose that G(A) is weakly connected. Note that k(A) = 
implies that x rn = and that every pair with a^- > or ciji > satisfies 

Xi = Xj. Now the weak connectivity of G(A) implies that every entry of x must be 
identical. Since x m — 0, we in fact have that x is the zero vector, which contradicts 
the normalization condition \\x\\*j = 1. Consequently, k(A) = is not possible if G(A) 
is weakly connected. □ 

2.3. The sieve constant and the learning protocol. With the above lemmas 
in place, we can now turn to the analysis of our learning protocol. For the remainder 
of Subsection 12. 31 we will assume that k = 1, i.e, /i and all Vi(t) belong to K. We will 
then define v(t) to be the vector that stacks up vi(t), . . . , v n (t). 

The following proposition describes a convenient way to write Eq We omit 

the proof (which is obvious). 

Proposition 2.4. We can rewrite Eq. hl.l}) and Eq. U.Sfy as follows: 

y(t + l)=A(t)v(t)+b(t) 

q(t + 1) = (1 - A(t))v(t) + A{t)y(t + 1) 

v(t + l) = q(t + l)+A(t)r(t), 

where: 

1. If i ^ j and i,j are neighbors in G(t), 

1 

aij(t) 



4m&x((U(t),dj(t))' 



However, if i ^ j are not neighbors in G(t), then CLij(t) = 0. As a conse- 
quence, A(t) is a symmetric matrix. 
2. If node i does not have a measurement of /i at time t, then 



au(t) = l-- 



Sejm ^ max K(*)'^'(*))' 
On the other hand, if node i does have a measurement of [i at time t, 
3 1 ^ 1 

Thus A(t) is a diagonally dominant matrix and its graph is merely the inter- 
communication graph at time t: G{A{t)) = G(t). Moreover, if no node has a 
measurement at time t, A(t) is stochastic. 

If node i does not have a measurement of /i at time t, then bi(t) = 0. If node 
i does have a measurement of fi at time t, then bi(t) = (1/4)/^. 
If node i has a measurement of \i at time t, Ti(t) is a random variable with 
mean zero and variance <t 2 /16. Else, Ti(t) — 0. Each is independent of 
all v(t). 



Definition 2.5. Let S(t) be the set of nodes with measurement of fi at time t 
and let s(t) be the cardinality of S(t). 
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The following pair of lemmas lower bound the decrease in Z(v(t)) from time t to 
f + 1. It turns out that we need two distinct lemmas to handle two cases: Lemma l2.6 
gives a bound in the case of s(t) = while Lemma |2~T1 handles the case when s(t) > 0. 

Lemma 2.6. If s(t) = 0, then 

A(i) _ (*(<)- ty(f)) a 



Z(v(t + l))<Z(v(fi))--±± E 



[k ,i)em max{dkit) > dlit)) 

Proof. By Proposition |2~H if s(t) = then v(t + 1) = U(t)v(t), where U(t) = 
(1 — A(i))I + A(t)A is a symmetric and stochastic matrix. Therefore 

V(t + 1) - (Ml = U(t)(v(t) - nl). 

We apply Corollary 12.21 to obtain 

\\v(t + 1) - = \\v(t) - fil\\ 2 2 - X)[tf 2 (t)]jw(«k(*) - Mt)) 2 - 

k<l 

We next lower bound [U 2 (t)]ki as follows: if (k,l) £ E(t), thenaki(t) > l/(4max(dk(t),di(t)) 
and consequently \U] kl (t) > A(t)/ (4m&x(dk(t), di(t)). Moreover, since A(t) is diago- 
nally dominant so is U(t) and consequently [U 2 ]ki > A(t)/(8 max(di(i), dj (t))). This 
now immediately implies the statement of the lemma. □ 
Lemma 2.7. If s(t) > and A(t) e (0, 1) then 



2 

2 



+ 1) I *(*)] < z H t)) - -12 e ^fe^ - E m*) - + -W^f' 

lA(t)K[G(t)])z(v(t))+s(t)^ tr - 
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Proof. Observe that, for any i, the vector /il satisfies 

/til = A(%tl + &(t), 

and therefore, 

y(t + l)-fil = A(t)(v(t)-iil). (2.1) 



We now apply Corollary 12.21 which involves the entries and row-sums of the matrix 
A 2 (t) which we lower-bound as follows. Because A(t) is diagonally dominant and 
nonnegative, we have that if (k,l) € E(t) then L4 2 ]fc;(i) > l/(8max(dfc(t),dj(i))). 
Moreover, if k has a measurement of fi then the row sum of the fc'th row of A equals 
3/4, which implies that the fc'th row sum of A 2 is at most 3/4. Consequently, Corollary 
12 21 implies 

M t + l)-^<Z m) -i E ^S-lEW')^ (2.2) 

Next, since A(t) G (0, 1) we can appeal to the convexity of the squared two-norm to 
obtain 

luc+j.-n in2^7/,r + ^ A W K(t)-^(t)) 2 A(t) 2 

8 fc<i , ^ )eE (t) max ( dfc W' d! W) 4 fc ^(t) 
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Since E[r{t)] — and |^(*) | li] = s(i)a 2 /16 independently of v(t), this immediately 
implies the first inequality in the statement of the lemma. 

To establish the second inequality, we just need to show that 

(k,i)eE(t) v KV h u " keS(t) 

and the rest of the proof proceeds exactly as after Eq. (|2.2I) . However, the above 
inequality follows immediately from the definition of «(£?(£)) after making the change 
of variables Xi(t) = Vi{t) — fi. □ 

2.4. Proofs of the main theorems. With the previous pair of lemmas in place, 
we now go on to provide proofs of Theorems 1-4. Note that it suffices to prove each of 
these theorems in the case of k = 1; once that case is proven, we can apply the theorem 
to each component of the vector Vi(t) to obtain the general case. Consequently, we 
will go on assuming that fc = 1 as in the previous subsection. 

Proof. [Proof of Theorem II. lj We may assume without loss of generality that 
A(t) € (0,1) since this is eventually true due to the monotonicity and square summa- 
bility of A(i). We first claim that there exists some constant c > such that if 
i fc = l+kmax(T,B), then 

E[Z(v(t k+1 )) | v{t k )] < (l-cA(t k+1 ))Z(v(t k )) + nma,x(T,B)A(t k ) 2 o- 2 . (2.3) 

We postpone the proof of this claim for a few lines while we observe that, as a 
consequence of our assumptions on A(i), we have the following three facts: 

Ea / \ ,„ „ « , .9 9 , n max(T, B)A(t k ) 2 cr 2 
cA U+i =+oo, V nmax (T,B A (t k fa 2 < oo, lim \' / \ k ' = 0. 
J fc^oo cA{t k+ i) 

Now Lemma 10 from Chapter 2.2 of [25] implies that limt_j. 00 Z(v(t)) = with prob- 
ability 1. 

To conclude the proof, it remains to demonstrate Eq. (12.31) . We can apply 
Lemmas 12.61 and 12.71 to obtain a bound on E[Z(v(t + 1)) | v(t)] for every time t 
between t k and t k +\. Putting these bounds together and taking expectations, we 
obtain that 

mvi U w I !+ M 7< (4 \\ 'v^ I A ( TO ) E[(v k (m) - vi(m)) 2 \ v(t k )} 

E[Z{v(t k+1 )) | v(t k )\ < Z(v(t k ))- 2^ 



m =t k V - meE (m) ™^{d k {rn)A{rn)) 
S K«*M - m) 2 I v(t k )} + A 2 (m)^ S (m) I . 

keS{m) J 



Consequently, Eq. (|2.3j) follows from the assertion 

. E^ 1 E (fc ,, )eG(m) g|KH - m(m)) 2 1 tt(t fc )] + E fcg 5 M g[KH - I n 

n=iM**)-*O a 

where the infimum is taken over all vectors v(t k ) such that v(t k ) ^ /il and over 
all possible sequence of undirected intercommunication graphs and measurements 



10 



between time t k and t k+ i — 1 satisfying the conditions of uniform connectivity and 
uniform measurement speed. Now since E^X 2 ] > E[X] 2 , we have that 

. nf S^; 1 J2(k,i)eG(m) E[(vk(m) vi(m)) 2 | v(t k )] + EkeSjm) E[(v k (m) - m) 2 | v(t k )] 

Er=iM^) - m) 2 

OeG(m) — «;( m ) I ^(^fe)] 2 + J2kes(m) E[v k (m) — \i \ v(tk)] 2 

- Er=i(«*(* fc )-M) 2 ' 

We will complete the proof by arguing that this last infimum is positive. 

Let us define z(t) = E[v(t) — fil \ v(t k )] for t > t k . From Proposition ^. 41 and Eq. 
(|2.1[) . we can work out the dynamics satisfied by the sequence z[t) for t > t k : 

z(t + l)=E[v(t + l)-fil | w(t fc )] 
= £%(t + 1) - /il | w(t fc )] 
= E[(l - A(t))v(t) + A(t)y(t + 1) - /il | w(t fc )] 
= £[(1 - A(t))(«(t) - M l) I w(tk)] + E[A(t)(y(t + 1) - /il) | «(t fc )] 
= £[(1 - A(t))(v(t) - /il) | w(t fc )] + E[A(t)^(t)(»(t) - Ml) I w(tfc)] 
= [(l-A(i))/ + A(iL4(i)]*(i). (2.5) 



Clearly, we need to argue that 

Emit' 1 E(k,l)&G(m)( Z k(m) - Zi{m)) 2 + EfceS(m) *fcM , . 

Si^(t») > ( J 

where the infimum is taken over all sequences of undirected intercommunication 
graphs satisfying the conditions of uniform connectivity and measurement speed and 
all nonzero z(t k ) (which in turn determines all the z(t) with t > t k through Eq. I|2.5p ). 

From Eq. (|2.5p . we have that the expression within in the infimum in Eq. (|2.6[) 
is invariant under scaling of z(t k ). So we can conclude that the infimum is achieved 
by some vector z(t k ) with ||z(ffc)||2 = 1. 

Now suppose z[t k ) is a vector that achieves this infimum; let S+ C {1, . . . , n} be 
the set of indices i with Zi(t k ) > 0, S- be the set of indices i with Zi(t k ) < 0, and So 
be the set of indices with Zi(t k ) = 0. Since ||js(tfc) — M^lla = 1 we have that at least 
one of S+ , S- is nonempty. Without loss of generality, suppose that S- is nonempty. 
Due to the conditions of uniform connectivity and uniform measurement speed, there 
is a first time t' < t k+ i when at least one of the following two events happens: (i) 
some node i' € SL is connected to a node j' G So U S+ (ii) some node i' G S_ has a 
measurement of /i. 

In the former case, Zi>(t') < and Zj>(t') > and consequently (zj'(i') — Zji(t')) 2 
will be positive; in the latter case, Zi'(t') < and consequently z 2 (t!) will be positive. 
In either case, the infimum of Eq. (|2.6|) will be strictly positive. □ 

Having proved Theorem ll.il we next turn to the proof of Theorem II. 21 A crucial 
part in the proof will be played by the following lemma, which is a modification of a 
lemma from |11) . 

Lemma 2.8. Suppose the sequence b k satisfies 
bk+i < {^-TT—)bk 



k 2 ~ 2e ' 
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where e G (0, 1) and c kl d k are positive sequences and c k is bounded away from zero. 
Then 

lim sup k 1 ~ e b k < sup—. 

k^f-oo k Ck 

Proof. Let q = sup k d k /c k ; we may assume that q < oo because if q is infinite 
then the lemma is trivially true. Fix 8 > and let q(S) be a positive number such that 
d' k = c k q{5) satisfies d' k > dk + S for all k; such a q(d) exists due to the assumptions 
that sup fc dk/ck < oo and that Ck is bounded away from zero. Because for fixed 
e e (0, 1) we have l/fc 1_e - l/(fc+ l) 1_e = 0(l/fc 2 ~ e ) it follows that for large enough 
k, 

dk < c k q{8) i q(S) q{6) 



k 2 - 2 * (fc + l) 1 - 6 fc 1 - 6 



(fc+1) 1 ^ K k 

and therefore for large enough k, 

q(S) 4 _ . _ <*fc 

fe (jfe + l)l-= SU k^ )k+ k 2 - 2 ^ [ k 2 - 2 * 

Because e € (0, 1) and is positive and bounded away from zero this implies that b k — 
q(S)/k 1 ~ e approaches zero faster than the inverse of any polynomial in k. Therefore, 

limsupfc 1_e 6 fe < q(6). 

k 

Since this is true for any 6 > we have that 

lim sup k b k < q. 

k 

□ 

With this lemma in place, we now proceed to the proof of Theorem 11.21 The 
proof has many parallels to the proof of Theorem 1 1.1[ in particular in that it relies on 
the repeated application of Lemmas 12.61 and 12 . 71 to produce a recursion which bounds 
E[Z(v(t)) | «(£*)]. Lemma T2.81 which we have just proved, will be used to obtain a 
precise convergence time decay out of this recursion. 

Proof. [Proof of Theorem II. 2j Let t k be one plus the time when the fc'th mea- 
surement occurs; naturally, t k < 1 + kT. Lemmas 12.61 and 12.71 imply that 

E[Z{v{t k+1 )) | v(t k )\ < (1 - ^A(t k+1 ) K [G(t k+1 )})Z(v(t k )) + s(t k+1 )A(t k+1 ) 2 ^. 
Tautologically K,(G(t)) > mm t K(G(t)), and therefore 

E[Z{v{t k+1 )) | v(t k )] < (l~lA{t k+1 )mmK[G{t)])Z{v{t k ))+nA{t k+1 ) 2 ^-. 

o t lb 
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Plugging in A(t) = 1/t e , we obtain 

„r„/ / „ , , ., . (1/8) mint n \G(t)\ . „ . , ., ncr 2 /16 

''k+l ''k+l 

Let tk = k/tk+i- We may rewrite the above relation as 

i?\vt.,(+ w I u W ^ (i t 1 k r\l/8)iaiii t n[G(t)] t k ~ 2e na 2 /16 
E[Z{v{tk+i)) | u(tfc)J < (1 ^ )^(w(*fc)J + pr^ ■ 

Taking expectations ol both sides, 

E[Z(v(t k+1 )) | «(0)] < (1- k W 1 V n )E[Z{v{t k )) | «(0)]+ fc k2 _J . 

We now argue that the assumptions of Lemma [2T81 apply. All that needs to be verified 
is that tk is bounded above and away from zero, which is, of course, true: 1 > tk > 
1/(2T) because k < tf-+i < (k + 1)T. Application of Lemma [2T51 gives 

2 

77 (T 

hmsupfc 1 -^^,.)) | »(0)] < — (2.7) 

Finally, to complete the proof we must pass from a statement about the decay of 
E[Z(v(tk)) | v(Q)] at the instances tk to a statement about the decay of E[Z(v(t)) | v(Q)] 
for general t. So suppose that prior to time t there have been m(t) instants when 
a node has had a measurement. Because E[Z(v(t + 1)) | v(t)] < Z(v(t)), if t is not 
among the tk then Eq. (|2.7[) implies 



\imsup(m(t))^E[Z(v(t)) \ v(0)} < — 

t 2mm t K[G(t)] 

Since m(t) > (t— 1)/T, the last equation implies Theorem 11.21 □ 

We now turn to the proof of Theorem ll.4l Its proof requires a certain inequality 
between quadratic forms in the vector v{t) which we separate into the following lemma. 

Lemma 2.9. Let tk = l+fcmax(T, B) and assume that the entries of the vector 
v(tk) are ordered monotonically as 

vi(tk) < v 2 {tk) <■■■ < v n (t k ). 

Further, let us assume that none of the Uj(tfe) equal \x, and let us define p_ to be 
the largest index such that v p _(tk) < \i and p + to be the smallest index such that 
v p (tk) > jJ, is nonnegative. We then have 

tk + l — l 

E[{v k {m) - vi(m)f \ v(t k )} + ^ E [(M™) ~ V? I "(**)] > 

rn=t k (k,l)GE(m) fceS(m) 

(v P _(t k ) - /j) 2 + {v p+ (tk) - n) 2 + _ ^+i( f fc)) 2 - ( 2 - 8 ) 

i— l,...,n, i^P- 

Proof. The proof parallels a portion of the proof of Theorem ll.il First, we change 
variables by defining z(t) as 

z{t) = E[z(t) - nl | v(t k )] 
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for t > tk- We claim that 

tk+X — 1 71— 1 

X) X! (^M-^M) 2 + ^ z fe( m ) >2p_(*fc)+Zp + fa)+^(zi(*fc)-W** 

m=t* (fe,i)e-E(m) fceS(m) i=l 

(2-9) 

The claim immediately implies the lemma after application of the inequality i?[JT 2 ] > 
E[X} 2 . 

Now we turn to the proof of the claim, which is similar to a proof of lemma from 
[2"2] . We will associate to each term on the right-hand side of Eq. (|2.9|) a term on the 
left-hand side of Eq. (|2.9I) , and we will argue that each term on the left-hand side is 
at least is big as the sum of all terms on the right-hand side associated with it. Since 
every term on the left-hand side is clearly nonnegative, this will prove the lemma. 

To describe this association, we first introduce some new notation. We denote 
the set {1, . . . , 1} by Si; its complement the set {/ + 1, . . . , n} is then Sf. If I ^ p-, 
we will abuse notation by saying that Si contains zero if I > p+; else, we say that 
Si does not contain zero and Sf contains zero. However, in the case of I — p_, we 
will say that neither of S p _ and S p _ contains zero. We will say that Si "is crossed 
by an edge" at time to if a node in Si is connected to a node in Sf at time m. For 
I =/= we will say that Si is "crossed by a measurement" at time m if a node in 
whichever of Si,Sf that does not contain zero has a measurement at time to. We 
will say that S p _ is "crossed by a measurement from the left" at time m if a node in 
S p _ has a measurement at time to; we will say that it is "crossed by a measurement 
from the right" at time to if a node in S p _ had a measurement at time to. Note 
that the assumption of uniform connectivity means that every Si is crossed by an 
edge at least one time m £ {tk, • ■ ■ , ifc+i — 1}- It may happen that some Si are 
also crossed by measurements, but it isn't required by the uniform measurement 
assumption. Nevertheless, the uniform measurement assumption implies that S p _ 
is crossed by a measurement at some time to € {tk, ■ ■ ■ ,tk+i — 1}- Finally, we will 
say that Si is crossed at time to if it is either crossed by an edge or crossed by a 
measurement (plainly or from left or right). 

We next describe how we associate terms on the right-hand side of Eq. (|2.9I) with 
terms on the left-hand side of Eq. (12.91) . Suppose I is any number in 1, .. . ,n — 1 
except p_; consider the first time Si is crossed; let this be time m. If the crossing is 
by an edge, then let be any edge which goes between Si and Sf at time to. We 
will associate (zi(tk) — zi + i(tk)) 2 with (zi(m) — Zj(m)) 2 ; as a shorthand for this, we 
will say that we associate I with the edge (i, j) at time to. On the other hand, if Si is 
crossed by a measurement at time m, let i be a node in whichever of Si, Sf does not 
contain zero which has a measurement at time to; we associate (zi(tk) — zi+\{tk)) 2 with 
zf(m); as a shorthand for this, we will say that we associate I with a measurement 
by i at time to. 

Finally, we describe the associations for the terms v p _(tk) 2 and v p +(tk) 2 , which 
are more intricate. Again, let us suppose that that S p _ is crossed first at time to; if 
the crossing is by an edge, then we associate both these terms with any edge 
crossing S p _ at time m. If, however, S p _ is crossed first by a measurement from 
the left, then we associate v 2 _(tk) with z 2 (m), where i is any node in S p _ having a 



2 If Si is crossed both by an edge and a measurement at time m, we will say it is crossed by an 
edge first. Throughout the remainder of this proof, we keep to the convention breaking ties in favor 
of edges by saying that Si is crossed first by an edge if the first crossing was simultaneously by both 
an edge and by a measurement. 
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measurement at time m. We then consider u, which is the first time S p _ is crossed by 
either an edge or a measurement from the right; if it is crossed by an edge, then we 
associate v p+ (tk) with (zi(u) — Zj(u)) 2 with any edge at going between S p _ and 
Sp_ at time u; else, we associate it with z 2 (u) where i is any node in S p _ having a 
measurement at time u. On the other hand, if S p _ is crossed first by a measurement 
from the right, then we flip the associations: we associate v p (t k ) with z 2 (m), where 
i is any node in S p _ having a measurement at time m. We then consider u, which is 
now the first time S p _ is crossed by either an edge or a measurement from the left; if 
S p _ is crossed by an edge first, then we associate v p -(tk) with (zi(u) — Zj(u)) 2 with 
any edge at going between S p _ and S p _ at time u; else, we associate it with 

zf{u) where i is any node in S p _ having a measurement at time u. 

We now go on to prove that every term on the left-hand side of Eq. ()2.9p is at 
least as big as the sum of all terms on the right-hand side of Eq. (|2.9|) associated with 
it. 

Let us first consider the terms (zi(m) — Zj(m)) 2 on the left-hand side of Eq. 
(|2.9|) . Suppose the edge with i < j at time m was associated with indices 

h < h < ■■■ < U (i-e., with the terms (z h (t k ) - z h+1 (t k )) 2 , . . . , (zi r (t k ) - zi r+1 (t k )) 2 . 
The key observation is that if Si has not been crossed before time m then 

max Zi(m) < zi(t k ) < z i+1 (t k ) < min z l {m). 

i=l,...,l i— l-\-l,...,n 

Consequently, 

Zi(m) < z h (t k ) < z h+ i(t k ) < zi 2 (t k ) < zi 2+ i(t k ) < ■■■< z ir (t k ) < z lr+1 (t k ) < Zj(m) 
which implies that 

{zi(m)-Zj(m)) 2 > (z h+ i(t k )~z h (t k )) 2 +(zi 2+ i(t k )-zi 2 (t k )) 2 -\ ^(zi r+1 (t k )-z lr (t k )) 2 ■ 

This proves the statement in the case when the edge is associated with l± < I2 < 
■ ■ ■ < l r . 

Suppose now that the edge is associated with indices l\ < I2 < ■ • • < l r 

as well as both the terms z 2 _ (t k ), z p+ (t k ). This happens when every and S p _ is 
crossed for the first time by (i, j), so that we can simply repeat the sequence of steps 
in the previous paragraph to obtain 

(z.W-z^m)) 2 > {z h+1 {t k )-z h {t k )) 2 +{zi 2+l {t k )-zi 2 {t k )) 2 +---+{zi T+ ^ 

which, since (z p _ (t k ) — z p+ (t k )) 2 > z 2 _{t k ) + z 2 + {t k ) proves the statement in this case. 

Suppose now that the edge (i, j) with i < j at time m is associated with indices 
h < h < ■ ■ ■ < lr as well as the term z 2 _(t k ). This happens when every li has not 
been crossed before time m, 5" p _ is being crossed by an edge at time m and has been 
crossed from the right but not from the left before time m. Consequently, in addition 
to the inequalities i < > l r + 1 we have the additional inequalities i < p- while 
j > p+ (since (i, j) crosses S p _ ). Because S p _ has not been crossed by an edge before, 
we have Zj(m) > 0, so that 

(zi^-Zjim)) 2 > (z h+1 (t k )-z h (t k )) 2 +(z l2+1 (t k )-zi 2 (t k )) 2 +---+{z lr+1 (^ 

which proves the statement in this case. 

The proof when the edge is associated with index l\ < ■ ■ ■ < l r and z p+ (t k ) 
is similar, and we omit it. Consequently, we have now proved the desired statement 
for all the terms of the form (zi(m) — Zj(m)) 2 . 
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It remains to consider the terms z 2 (m). So let us suppose that the term zf{m) 
is associated with indices l\ < I2 < ■ ■ ■ < l r as well as possibly one of z 2 _ (tk), z 2 + (ij.) 
(due to the way we defined the associations, it can never be associated with both). 
Observe that we must have either i < h (if Si r does not contain zero) or i > l r (if 
Si r contains zero). We suppose that it is the former case; the proof in the latter 
case is similar. Thus we must have i < l r < p_ so it is not possible for z 2 (m) to be 
associated with z 2 (t k ); however, v 2 _(t k ) might still be associated with it. Since 
has not been crossed before, we have that 

Zi(m) < z h (t k ) < z h+1 (t k ) < z h (t k ) < z h+1 (t k ) <■■■< z ir (tk) < zi r+ i(t k ) < z p _(t k ) < 
and therefore 

zf(m) > (zi 1+ i(t k )-zi 1 {t k )) 2 +(zi 2+1 (t k )-zi 2 (t k )) 2 +^ 

which concludes the proof. □ 

We now put all the pieces together and provide a proof of Theorem 11.41 



Proof. [Proof of Theorem 11.4] As in the statement of Lemma 12.91 let us choose 
t k = 1+k max(T, B). Observe that by continuity Lemma (I2.8[) holds even with the 
strict inequalities between Vi(t k ) replaced with nonstrict inequalities and without the 
assumption that none of the Vi(t k ) are zero; moreover, using the inequality 

{v p _(t k )-ij,y+(vp + (t k )-fx) > , 

we have that Lemma (|2.8|l implies that 

tfc+i— 1 - 
J2 E[(v k (m)- Vl (m)) 2 \v(t k )}+ J2 E[(v k (m)-^) 2 \ v(t k )} > -n(L n )Z(v(t k )). 

m=t h (h l l)eB{m) k£S(m) 

Because A(i) is decreasing and the degree of any vertex at any time is at most d max , 
this in turn implies 



m=t„ 8 (fc,i) S B(m) max(d k (m),di(m)) 4 fcgg(m) did max 

Now appealing to Eq. (|2.4p . we have 

B[Z(t>(tfc+i)) I < (l- A ^ tk +^ Ln A Z(v(t k ))+nA(t k ) 2 m&x(T,B)^. 



32d max / Hi 
Taking expectations and using Lemma 12.81 

r ( £ t?\7( ,(4- \\ 1 cnM ^ 2nd max max(T,S)cr 2 

limsup FF^V E[Z(v{t k )) u(0) < — . 

tfc Vmax(T, B)y k(L„) 

which implies 

1 «„r„, . \ \ 1 / m 2nd max max(T, B) 2 ~ e a 2 

Yimsuvt 1 -' E[Z(v{t k )) \ v(0)] < ^ r V ^ y . (2.10) 



n(L n ) 
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This is nearly the statement of the theorem; we need an argument which will allow 
us to replace t k by an arbitrary t. We next argue that this is possible at the expense 
of adding a factor of 2 to the right-hand side. Indeed, Corollary 12.21 implies that 
HQ^Ib < | \x\ 1 2 for any symmetric nonnegative stochastic matrix Q. Consequently for 
any t between tk and tjt+i, we have 

E[Z(v(t + 1)) | v(t)\ < Z(v(t)) + nA 2 (t)a 2 /l6. 

Taking expectations, 

E[Z(v(t)) | v(t k )} < Z(v(t k ))+nA 2 (t k )max(T,B)a 2 /16. 

We combine this inequality with Eq. (|2 . 10[) in the following chain of inequalities: 
letting pk be the largest t k which is at most t, and observing that for t > 2 max(T, B) + 
1 we have t < 2p k for k > 1, we have 

limsupt 1-e J5[Z(v(t)) | u(0)] < ]imsup* 1_e E[.E[Z(u(i) \ v(p k ),v(0)}] 
t t 



< limsupt 1 - 6 (E[Z(v(p k )) | v(0)]+nA 2 (p k )m&x(T,B)a 2 /16) 



< limsu P 2p^ e [E[Z(v(p k )) | v(0)] + max(T, B)ct 2 /16 

k V p k 

< SW.naxmax^.B) 2 -^ 2 ( Q 

k(L„) 

The theorem is now proved. 
□ 

We finally turn to the proof of our last major result, Theorem 11.31 We note 
that this proof has close parallels with the arguments used in [16] to prove eigenvalue 
bounds for stochastic matrices. 

Proof. [Proof of Theorem 11.3] We will show that for any m, 
min x 2 n + V (xi -xj) 2 > — — — 

This then implies the theorem immediately from the definition of the sieve constant. 

Indeed, we may suppose m — 1 without loss of generality. Suppose the minimum 
in the above optimization problem is achieved by the vector x; let M be the index 
of the component of x with the largest absolute value; without loss of generality, we 
may suppose that the shortest path connecting 1 and Misl — 2 — — M (we can 
simply relabel the nodes to make this true). Moreover, we may also assume xm > 
(else, we can just replace x with —a;). 

Now the assumptions that ||x||2 = 1, that xm is the largest component of x in 
absolute value, and that xm > imply that xm > l/v 7 ^ or 

(xi - 0) + (x 2 - xi) H h (xm - x M -i) > — 1= 

and applying Cauchy-Schwarz 

M(x\ + (x 2 - xi) 2 H (x M - ijm) 2 ) > -, 

n 
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or 



x\ + (x 2 - xi) 2 H (x M - xm-i) 2 > 77- > 



□ 

We end this section with an extended remark discussing how our results may be 
ported to prove guarantees for cooperative learning in settings when nodes unpre- 
dictably enter and leave the system. 



Remark 2.10. We remark that our learning protocol is capable of coping with 
persistent node failures. Consider, for example, the scenario in which nodes sometimes 
"terminate," i.e., nodes are occasionally removed permanently from the system. Note 
that the proofs of our main theorems work by bounding how much Z(v(t)) decreases 
and the termination of a node can only decrease Z(v(t)). Thus our proofs can be easily 
extended to cover this possibility. One can assert, for example, that convergence to [i 
happens with probability 1 to ji as long as at least one node does not terminate. In 
fact, since termination only decreases n, we can further assert that all of the Theorems 
1-4 immediately carry over to the setting when nodes are allowed to terminate with 
only the additional proviso that at least one node does not terminate. 

Conversely, suppose that new nodes are allowed to periodically join the system. 
The decay bounds that we have derived can be applied immediately after a new node 
joins. We can assert, for example, that as long as finitely many new nodes join the 
system, convergence to [i occurs with probability 1 . To derive quantitative convergence 
bounds, we need some control over the values of new nodes that join the system; if 
these are far from /i, convergence may take an arbitrarily long time. 

It seems natural to suppose that new nodes join the system with a value which lies 
in [fi — L,fi + U], where L and U are some upper and lower bounds. Consequently, the 
addition of a new node to the system increases Z(v(t)) by at most max(L, U) 2 , after 
which our decay bounds of Lemmas \2. 6\ and \2. 7| immediately apply. Thus depending on 
the time that the joins occur, we may derive convergence rate bounds. For example, if 
we know that j nodes join between times 1 and k, but none after, a version of Theorem 
3 can be proven with the following conclusion: 

lim sup (t-k?-E[Z(v(t)) | ,(0)] < 4nd -° 2 ^ g» +1 ™*(*. Uf + k (n + jV/16) 

The case when both joins and terminations are allowed is not straightforward; it is 
clear that convergence to /x may be impossible if, say, a node terminates after every 
time it measures /x and before it contacts other nodes. However, we can nevertheless 
prove analogues of Theorems 1-4 in a variety of settings. 

For example, suppose that at least one node has a measurement every T steps 
and does not terminate in the following B steps. Suppose further that nodes which 
join the network do so with values that are equal to a value of one of the nodes in the 
network. This implies that a join at most doubles Z(v(t)). Consequently, as long as 
joins happen less frequently with time - say, slower than the bounds of Theorems 2 
and 3 imply that E[Z(v(t))] gets multiplied by 1/4 - convergence with probability 1 to 
\x still occurs. 

3. The sieve constant. The previous section showed that the scaling of our 
learning protocol with network structure can be upper bounded in terms of the sieve 
constant. In this section, we proceed to compute sieve constants for various common 
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graphs. The bounds we derive in this section may be immediately translated into 
improved bounds on the performance of our learning protocol on certain classes of 
graphs. For example, we show that for the complete graph K n , the sieve constant 
satisfies the upper bound l/n(K n ) < cn for some constant c; this is two orders of 
magnitude better than the bound 1/ n(G n ) < n 3 which follows from Theorem ll.31 and 
it may be plugged in directly into the statement of Theorem 11.21 to get an improved 
convergence for our protocol on the complete graph. 

We are primarily interested in the scaling of the sieve constant with the number 
of nodes n, and correspondingly we will be satisfied to compute sieve constants to 
within a constant factor. We will extensively use the notations f(n) = Cl(g{n)) and 
f(n) — 0(g(n)) which mean, respectively, f(n) > cg(n), and cg{n) < f(n) < Cg{n), 
for some constants c, C not depending on n. We begin with a lemma which will 
simplify some of the forthcoming computations. 

Lemma 3.1. Let G — (V,E) be an undirected, connected graph and let V be the 
group of automorphisms of G which fix the vertex m. If the minimization problem 

. x m + EafiefiC 1 / max(d i ,d i ))(a; j - Xjf 
mm ^ 

has an optimal solution with x m ^ then it has a solution that is invariant under the 
actions of any P 6 V . 

Proof. We make a few preliminary remarks before turning to the proof. Let us 
define Eij to be the matrix with 1 in the (i, j)'th place and zero in all other entries. By 
the Courant-Fischer theorem, the optimal value in the above optimization problem 
is the smallest eigenvalue X n of the matrix B = E mm + L, where L is the Laplacian 
matrix of the graph G with edge weights uiij = l/(max(<ij, dj)). Moreover, the set of 
vectors achieving the optimal value is precisely the set of all nonzero vectors in the 
eigenspace corresponding to that eigenvalue. 

The condition that V is the group of automorphisms of G which preserve m may 
be stated as follows: P 6 V if and only if P is a permutation matrix satisfying 
Pe m — e m and P~ 1 BP = B. Therefore, if x is an eigenvector of B with eigenvalue 
A, then Px is also an eigenvector of B with eigenvalue A: 

BPx = PP~ 1 BPx = PBx = XPx. 



We turn now to the proof of the lemma. Suppose x is a vector which achieves 
the minimum in the above minimization problem. For any P 6 P, let i(P) be the 
smallest integer so that P z ( p ) is the identity permutation. Define 

Pev j=i,...,i(P) 

Clearly, y is invariant under the action of any PeP. Because x is an eigenvector of 
B corresponding to the smallest eigenvalue, so is y; moreover, y ^ because x m ^ 
and every power of P € V fixes m. Consequently, y achieves the minimum of the 
optimization problem. This concludes the proof. □ 

The following series of propositions gives the sieve constants of some common 
graphs up to a constant factor. 

Proposition 3.2. The sieve constant of the complete graph K n is 0(l/n). 
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Proof. As a consequence of Lemma 13.11 we have two cases to consider. In the 
first case x m = and then the value of n(K n ) is equal to the optimal value of the 
minimization problem 



n(K n ) = min - V" (x k - xi) 2 -\ — - V" 

» a =x )fl : ra =o n-1 *-f n-l ^ 



Since the second term always equals l/(n— 1), the best we can do is pick an x that 
sets the first term equal to zero; this yields the value l/(n — 1) for the optimal value 
of the minimization program. 

In the second 7^ but then by Lemma 13.11 the value of every Xi with 

i =/= m is the same; consequently, in this case the value of n{K n ) is equal to the optimal 
value of the minimization problem 

K,(K n ) = min a 2 + (b — a) 2 . 

a 2 + (n-l)b 2 = l 



Observe that if a > l/v3n, then the objective function is at least l/(3n); and if 
a < 1/V3n then (n — l)b 2 > (3n — l)/(3n) so b 2 > 1/n; in turn, this means the 
objective function is at least (1/y/n— l/V3n) 2 which is Q(l/n). Since we can find an 
x which achieves an objective value of 0(l/n) (as we saw in the previous paragraph) 
we can conclude that n(K n ) = 0(l/n). □ 

Proposition 3.3. The sieve constant of two complete graphs connected by an 
edge (see Fiaure lKl]) , which we denote K n — K n , is <d(l/n 2 ). 



Fig. 3.1. The graph K n - K n for n = 4. 

Proof. By Theorem ll.31 n(K n — K n ) > l/(3n 2 ). On the other hand, let m be any 
point in the first complete graph, and set xi = for all i in the first complete graph 
and Xi = 1/ \Jnj2 for all i in the second. This yields 

K(K n -K n ) < 1 - 
n/2 + ln 

so that K(K n - K n ) = 6(l/n 2 ). □ 

Proposition 3.4. The sieve constant of the line graph L n is Q(l/n 2 ). 

Proof. Observe that by Theorem 1 1 .31 n{L n ) > l/(2n 2 ). On the other hand, 
number the vertices of the line 1, . . . , m from left to right, set m = 1 and pick Xi — 
i/n 1 - 5 . Then 

{ n) - Er =1 ^/« 3 -°U 2 J- 
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Consequently, n{L n ) = 6(l/n 2 ). □ 

Proposition 3.5. The sieve constant of the ring graph R n is 0(l/n 2 ). 

Proof. Theorem 11.31 gives n(R n ) > l/(2n 2 ). Suppose n is even. Number the 
nodes 1, . . . , m counterclockwise and pick m = 1. Setting 



Xi = 



i/n 1 ' 5 if i = 1, . . . , n/2 
2f4 xEi = n/2 + l,...,n 



yields the upper bound 

K (R n ) < 



1/n 3 + n(l/n 3 ) 



ri/2-1 



< o 



Consequently, k(L„) = 8 (1/n 2 ). A similar argument proves the same conclusion 
when n is odd. □ 

Proposition 3.6. The sieve constant of the star graph S n (see Figure HO)) is 
9(l/n 2 ). 



Fig. 3.2. The star graph S4. 

Proof. Theorem 11.31 gives that n(S n ) > l/(2n 2 ) because the diameter is 2. On 
the other hand, choose m to be a leaf and set x m = and Xi — \ j \Jn — 1 elsewhere. 
This yields 

1 

k(S„) < 



(n- l) 2 ' 
which proves this proposition. □ 

4. Simulations. We report here on several simulations of our learning protocol. 
These simulations confirm the broad outlines of the bounds we have derived; the 
convergence to fi takes place at a rate broadly consistent with a decay in 1 jt and the 
scaling with n appears to be polynomial. 

Figure 14.11 shows plots of the distance from /1 for the complete graph, the line 
graph (with one of the endpoint nodes doing the sampling), and the star graph (with 
the center node doing the sampling), each on 40 nodes. We caution that there is 
no reason to believe these charts capture the correct asymptotic behavior as t — > 00. 
Nevertheless, based on the performance shown in these plots, we see that a linear 
decay in the distance from \x with time appears to be plausible. 

Intriguingly, the star graph and the complete graph appear to have very similar 
performance. By contrast, the performance of the line graph is an order of magnitude 
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Fig. 4.1. The three plots show the quantity \\v(t) — jul||oo as a function of the number of 
iterations. The graph on the left corresponds to the complete graph, the middle graph corresponds to 
the star graph, and the graph on the right corresponds to the line graph. Each graph has 40 vertices 
and in each case, exactly one node is doing the measurements; in the star graph it is the center 
vertex and in the line graph it is one of the endpoint vertices. The initial vector is random with 
entries uniform in [0,5] in each case. Stepsize A is chosen to be 1/t 1 / 4 for all three simulations. 



inferior to the performance of either of these; it takes the line graph on 40 nodes on 
the order of 400,000 iterations to reach roughly the same level of accuracy that the 
complete graph and star graph reach after about 10,000 iterations. 

Next, Figure B~2l focuses on the scaling with the number of nodes n. The graphs 
show the time until | \v(t) — fil\ decreases below a certain threshold as a function of 
number of nodes. We see scaling that could plausibly be quadratic for the line graph 
and linear for the complete graph, which matches the upper bounds we have derived 
in Section [3] However, we see scaling which appears linear for the star graph which 
is an order-of-magnitude better than the quadratic upper bound of Section [3] 



^me to reduce dist 



20 25 30 



15 20 25 30 



Fig. 4.2. The three plots show how long it takes the quantity \ \v(t) — fil\\oo to shrink below 1/2 
starting from a random vector with entries in [0, 5]. The graph on the left corresponds to the complete 
graph, the middle graph corresponds to the star graph, and the graph on the right corresponds to 
the line graph. In each case, exactly one node is doing the measurements; in the star graph it is the 
center vertex and in the line graph it is one of the endpoint vertices. Stepsize is chosen to be 1/t 1 / 4 
for all three simulations. 

Finally, we include a simulation for the lollipop graph, defined to be a complete 
graph on n/2 vertices joined to a line graph on n/2 vertices. The lollipop graph often 
appears as an extremal graph for various random walk properties (see, for example, 
[8]). The node at the end of the stem, i.e., the node which is furthest from the 
complete subgraph, is doing the sampling. The scaling with the number of nodes is 
considerably worse than for the other graphs we have simulated here. 

Finally, we emphasize that the learning speed also depends on the precise location 



22 



Dislance to mu vs iteration number x , g 4 Time to reduce distance to mu vs. number of nodes 




1 2 3 4 5 6 7 5 10 15 20 25 30 35 



Fig. 4.3. The plot on the left shows \\v(t) — fJ.l\\rx> as a function of the number of iterations 
for the lollipop graph on 40 nodes; the plot on the right shows the time until \ \v(t) — fj,l\\oo shrinks 
below 0.5 as function of the number of nodes n. In each case, exactly one node is performing the 
measurements, and it is the node farthest from the complete subgraph. The starting point is a 
random vector with entires in [0,5] for both simulation and stepsize is 1/t 1 / 4 . 

of the node doing the sampling within the graph. While our results in this paper 
bound the worst case performance over all choices of sampling node, it may very well 
be that by appropriately choosing the sensing nodes, better performance relative to 
our bounds and relative to these simulations can be achieved. 

5. Conclusion. We have proposed a model for cooperative learning by multi- 
agent systems facing time- varying connectivity and intermittent measurements. We 
have proved a protocol capable of learning an unknown vector from independent 
measurements in this setting and provided quantitative bounds on its learning speed. 
Crucially, these bounds have a dependence on the number of agents n which grows 
only polynomially fast, leading to reasonable scaling for our protocol. The sieve 
constant of a graph, a new measure of connectivity we introduced, played a central 
role in our analysis. 

Our research points to a number of intriguing open questions. Our results are 
for undirected graphs and it is unclear whether there is a learning protocol which 
will achieve similar bounds (i.e., a learning speed which depends only polynomially 
on n) on directed graphs. It appears that our bounds on the sieve constant given 
on Theorem 4 are loose by an order of magnitude (when compared with examples in 
Section[3]) so that the learning speeds we have presented in this paper could potentially 
be further improved. In particular, comparing the scalings derived in Section [3] with 
the worst-case bound on the sieve constant given by Theorem 1 1 . 31 raises the possibility 
that Theorem 11.31 may not be tight. Moreover, it is further possible that a different 
protocol provides a faster learning speed compared to the one we have provided here. 

Finally, and most importantly, it is of interest to develop a general theory of de- 
centralized learning capable of handling situations in which complex concepts need 
to be learned by distributed network subject to time- varying connectivity and inter- 
mittent arrival of new information. Consider, for example, a group of UAVs all of 
which need to learn a new strategy to deal with an unforeseen situation, for example, 
how to perform formation maintenance in the face of a particular pattern of turbu- 
lence. Given that selected nodes can try different strategies, and given that nodes 
can observe the actions and the performance of neighboring nodes, is it possible for 
the entire network of nodes to collectively learn the best possible strategy? A the- 
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ory of general-purpose decentralized learning, designed to parallel the theory of PAC 
(Provably Approximately Correct) learning in the centralized case, is warranted. 
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